Exploring a local area#

Having a global dataset is a great start, but how do we get insights about a local venue? You can look at the Climate TRACE map as a first start, but maybe you also want to display and extract information in your own way. This notebook shows you:

  • how to extract insights for a specific area

  • how to combine information about point and area sources in a single view

Exploring geographical data (also known as geospatial analysis) usually requires specific tools and a more complex setup, such as a database that can quickly identify points in a region. Our very efficient data representations and combined with modern tools such as DuckDB effectively alleviate the need for a complex setup.

Like Polars, DuckDB is a very efficient data-processing engine. It is focused on the SQL language rather than Python, and it provides geospatial primitives.

%load_ext autoreload
%autoreload 2
import logging
logging.basicConfig(level=logging.INFO)
#import os
import polars as pl
import duckdb
import geopandas
import plotly.graph_objects as go
import plotly.io
plotly.io.templates.default = "plotly_white"
import plotly.express as px
import numpy as np
import os

from ctrace.constants import *
import ctrace as ct
# Load all the geospatial extensions of DuckDB.
ct.data._ensure_duckdb()

Let’s load first all the points and geometries. It is small enough that we can load everything. If you do not have enough memory, you can filter to a specific country for example.

all_geoms_pdf = ct.read_polygons()
all_points_pdf = ct.read_points()
# points_path = "/tmp/climate_trace-points_v3-2024-ct4.parquet"
# polys_path = "/tmp/climate_trace-polygons_v3-2024-ct4.parquet"
# all_points_pdf = pl.read_parquet(points_path)
# all_geoms_pdf = pl.read_parquet(polys_path)

Create country-specific data#

country = "DEU"
years = [2022, 2023, 2024]
gas = CO2

c_year = pl.col("year")

gadm_biz_data = (ct.read_source_emissions(
    gas=gas,
    year=years
).with_columns(c_start_time.dt.year().alias("year"))
 .filter(c_iso3_country == country)
 .group_by(c_source_id, c_gas, c_year)
 .agg(c_geometry_ref.first(),
      c_emissions_quantity.sum(),
      c_sector.first(),
      c_subsector.first(),
      c_source_name.first(),
      c_lat.first(),
      c_lon.first())
 .filter(c_emissions_quantity != 0)
 .sort(by=[c_emissions_quantity.abs(), -c_year], descending=True)
 .collect()
)

gadm_biz_data
shape: (15_807, 10)
source_idgasyeargeometry_refemissions_quantitysectorsubsectorsource_namelatlon
u64enumi32strf64enumenumstrf64f64
1242030"co2"2024"gadm_DEU"9.5487e7"buildings""residential-onsite-fuel-usage""Germany"51.26562610.381492
1242030"co2"2023"gadm_DEU"9.5487e7"buildings""residential-onsite-fuel-usage""Germany"51.26562610.381492
1242030"co2"2022"gadm_DEU"9.4883e7"buildings""residential-onsite-fuel-usage""Germany"51.26562610.381492
8472160"co2"2023"gadm_DEU"-7.9143e7"forestry-and-land-use""removals""Germany"51.26562610.381492
8472160"co2"2024"gadm_DEU"-7.9143e7"forestry-and-land-use""removals""Germany"51.26562610.381492
37510953"co2"2024"trace__"0.014488"forestry-and-land-use""net-wetland""Siegen"nullnull
4613561"co2"2022"gadm_DEU.10.35_1"0.008565"forestry-and-land-use""net-wetland""Oberhausen"51.5131316.84712
10935642"co2"2023"gadm_DEU.1.15_1"0.004153"agriculture""cropland-fires""Heidelberg"49.4055158.694172
10935642"co2"2022"gadm_DEU.1.15_1"0.004153"agriculture""cropland-fires""Heidelberg"49.4055158.694172
10935642"co2"2024"gadm_DEU.1.15_1"0.004153"agriculture""cropland-fires""Heidelberg"49.4055158.694172
gadm_source_fname = f"/tmp/app_gadm_sources_{country}_{ct.data.version}.parquet"
gadm_biz_data.write_parquet(
    gadm_source_fname,
    row_group_size=1_000_000,
    compression="lz4",
    statistics=True,
)
app_ctry_polys_fname = f"/tmp/app_polys_{country}_{ct.data.version}.parquet"
(all_geoms_pdf.filter(c_iso3_country==country)
 .sink_parquet(
    app_ctry_polys_fname,
    row_group_size=1_000_000,
    compression="lz4",
    statistics=True,
))

Upload country-specific data to Hugging Face#

import huggingface_hub.utils
upload = True
if upload:
    try:
        api = huggingface_hub.HfApi()
        for fpath in [gadm_source_fname, app_ctry_polys_fname]:
            fname = os.path.basename(fpath)
            fname = f"app/{fname}"
            print(fname, fpath)
            api.upload_file(
                path_or_fileobj=fpath,
                path_in_repo=fname,
                repo_id="tjhunter/climate-trace",
                repo_type="dataset",
            )
    except huggingface_hub.utils.HfHubHTTPError as e:
        print("error")
        print(e)
app/app_gadm_sources_DEU_v3-2024-ct4.parquet /tmp/app_gadm_sources_DEU_v3-2024-ct4.parquet
app/app_polys_DEU_v3-2024-ct4.parquet /tmp/app_polys_DEU_v3-2024-ct4.parquet

Map a location to a Climate TRACE administrative region#

We now locate a point of interest on the Earth. This little widget allows you to type in an address, and it will return the position (in latitude and longitude). In this case, it returns the location of my home city in the Netherlands, on the island of Texel.

#lat,lng=52.1462902,4.3977197  # Wassenaar
lat,lng=53.0547292,4.7374957 # Texel
#lat,lng=35.6739875,51.2670981 # Teheran
#lat,lng=52.2114174,5.9341975 # Appeldoorn
#lat,lng=51.3437084,6.9354258 # Isenbugel
#lat,lng=37.8711563,-122.3426373 # Berkeley

Let’s go for some processing. Given this location, we first find the administrative region that contains this polygon. The following code calls DuckDB to identify the bounding administrative regions, and then converts them to a plottable object (a geopandas dataframe).

We also find the code name of the administrative region, because it will be used to find all the relevant information about this region. In this case, the codename for the island of Texel is ‘gadm_NLD.14.84_1’.

Keen readers will notice that this code employs brute force and checks our point against every single polygon. This is fast enough for this simple application. For repeated queries, DuckDB has an indexing feature.

# Take the lowest level in GADM
poly_geoms = (duckdb.sql(f"""
SELECT geometry_ref, gadm_level, geom_wkb FROM
(
    SELECT *, ST_GeomFromWKB(geom_wkb) AS geom FROM all_geoms_pdf
    WHERE ST_WITHIN(ST_POINT({lng},{lat}), geom)
)
""")
    .pl() # Convert to Polars dataframe
    .sort(by="gadm_level").tail(1) # Take the lowest level available
)
# Convert to geopandas
poly_geoms_pdf = geopandas.GeoDataFrame(
    poly_geoms.drop(["geom_wkb"]).to_pandas(),
    geometry=geopandas.GeoSeries.from_wkb(poly_geoms['geom_wkb']),
    crs="EPSG:4326")

# The reference code for the administrative region
gadms_ref = set(poly_geoms[GEOMETRY_REF].to_list())
print(f"Administrative region: {gadms_ref}")
Administrative region: {'gadm_NLD.9.48_1'}

Similarly, we find all the point references that are contained in the region. For my region, there are a few points of interest. These are the geographical references of the point sources. We will match them later against the emission data.

points_ref = set((all_points_pdf
 .filter(pl.col("gadm").is_in(gadms_ref))).collect()[GEOMETRY_REF].to_list())
points_ref
{'trace_4.764956128477592_52.9858927909357',
 'trace_4.776_53.078',
 'trace_4.8525_53.0425'}

And here is our zone of interest: the island of Texel, Netherlands.

def center(gpdf):
    [lng1,lat1,lng2,lat2] = gpdf["geometry"].total_bounds
    return {"lat": 0.5 * (lat1 + lat2), "lon": 0.5 * (lng1+lng2)}

px.choropleth_map(poly_geoms_pdf,
    geojson=poly_geoms_pdf.geometry,
    locations=poly_geoms_pdf.index,
    color="gadm_level",
    zoom=7,
    center = center(poly_geoms_pdf),
    opacity=0.5)
fig = go.Figure(go.Choroplethmapbox(
    geojson=poly_geoms_pdf.__geo_interface__,
    locations=poly_geoms_pdf.index,
    z=poly_geoms_pdf.gadm_level,
    colorscale="Viridis",
    zmin=0, zmax=12,
    marker_opacity=0.3,
    marker_line_width=3,
    showlegend=False,
    showscale=False
    ))
fig.update_layout(
    mapbox_style="carto-positron",
    mapbox_zoom=8, mapbox_center = center(poly_geoms_pdf))
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})

Collect all recent emissions for the area#

We now extract and display information about the area itself: for each source located within the area of interest, we collect the emission information over the last 3 years, for CO2.

Again, for monthly data, we just sum up over the years.

gadm_biz_data2 = (pl.scan_parquet(gadm_source_fname)
.filter(c_geometry_ref.is_in(gadms_ref.union(points_ref)))
.collect())
gadm_biz_data2
shape: (33, 10)
source_idgasyeargeometry_refemissions_quantitysectorsubsectorsource_namelatlon
u64enumi32strf64enumenumstrf64f64
1932755"co2"2023"trace_4.8525_53.0425"53773.689221"transportation""domestic-shipping""Oudeschild"53.04254.8525
1932755"co2"2024"trace_4.8525_53.0425"44156.570309"transportation""domestic-shipping""Oudeschild"53.04254.8525
1932755"co2"2022"trace_4.8525_53.0425"43859.302798"transportation""domestic-shipping""Oudeschild"53.04254.8525
3612706"co2"2024"gadm_NLD.9.48_1"26015.670138"transportation""road-transportation""Texel"53.0686994.752617
3612706"co2"2023"gadm_NLD.9.48_1"24964.249903"transportation""road-transportation""Texel"53.0686994.752617
4699123"co2"2023"gadm_NLD.9.48_1"-96.0185"forestry-and-land-use""net-wetland""Texel"53.0686994.752617
4699123"co2"2024"gadm_NLD.9.48_1"-96.0185"forestry-and-land-use""net-wetland""Texel"53.0686994.752617
11173590"co2"2023"gadm_NLD.9.48_1"1.54798"agriculture""cropland-fires""Texel"53.0686994.752617
11173590"co2"2024"gadm_NLD.9.48_1"1.54798"agriculture""cropland-fires""Texel"53.0686994.752617
11173590"co2"2022"gadm_NLD.9.48_1"1.54798"agriculture""cropland-fires""Texel"53.0686994.752617
(gadm_biz_data.filter(c_geometry_ref.is_in(gadms_ref))[SOURCE_NAME].unique().to_list())
['Texel']
# TODO: this is rather expensive, check the query plan can be optimized
# There is an issue with the sectors and subsectors enumerations that cause a panic in Polars.
c_year = pl.col("year")
gadm_biz_data = (ct.read_source_emissions(
    gas=CO2,
    year=[2022,2023,2024]
).with_columns(c_start_time.dt.year().alias("year"))
 .group_by(c_source_id, c_gas, c_year)
 .agg(c_geometry_ref.first(),
      c_emissions_quantity.sum(),
      c_sector.first(),
      c_subsector.first(),
      c_source_name.first(),
      c_lat.first(),
      c_lon.first())
 .filter(c_geometry_ref.is_in(gadms_ref.union(points_ref)))
 .filter(c_emissions_quantity != 0)
 .sort(by=[c_emissions_quantity.abs(), -c_year], descending=True)
 .collect()
)
gadm_biz_data #.head(5)

Let’s look at the year 2024 for CO2. We see some interesting things for the island of Texel:

  • domestic-shipping is the dominant source of emissions. It is the ferry that links up the island with the Den Helder harbour in North Holland, as well as other industrial ships providing raw materials.

  • road transportation (cars) and residential-onsite-fuel-usage (house heating) are the other dominant sources. These sources can be changed through local action, for example for better insulation of houses, more public transit, more electrical cars or bicycles.

  • the natural side (net-forest-land, removals) is not really contributing much, it is actually emitting in this case. Take these number with caution though, as they are so low that they could be within the margin of errors. In any case, even if the island of Texel has forests, they do not seem to balance out all the emissions from the human activities on the island.

Also, it should be repeated: these are the direct emissions linked to a territory. All the food produced outside the island and sold to the inhabitants will not be accounted for here. Indirectly, the island may have up to 50% more emissions when accounting all the imported goods that triggered emissions elsewhere.

px.bar(
gadm_biz_data.filter(c_year == 2024).sort(by=-c_emissions_quantity),
    x="subsector",
    y="emissions_quantity",
    color="sector"
)

How is it evolving over the years?

The island of Texel has forests, which offset out (somewhat) the CO2 emissions. The big jumps in the forestry sector from absorbing to emitting should be treated cautiously, as these numbers are really close to the margins of errors for the forestry sector.

px.bar(
gadm_biz_data.sort(by=-c_emissions_quantity),
    x="year",
    y="emissions_quantity",
    color="sector"
)

Finally, let’s look at the map where the point sources are located. Hover over the points in the map to get more information.

In the case of Texel, there are only two significant point sources: the harbours. They are themselves the aggregation of all the trips made by vessels to the island. In the climate TRACE dataset, airplanes and boats behave as though half of their trips are spent on the departing harbour and then the remaining half on the destination harbour with nothing in between. This is a standard assumption when assigning emissions.

def normalize_emissions_plot(s):
    """A small technical function to have good-looking scatter plots"""
    z = s.to_numpy()
    z = z - min(z) + 1
    z = z ** 0.3
    z = z / max(z)
    z = z * 0.5
    return z


fig = px.choropleth_map(poly_geoms_pdf,
    geojson=poly_geoms_pdf.geometry,
    locations=poly_geoms_pdf.index,
    color="gadm_level",
    zoom=7,
    center = center(poly_geoms_pdf),
    opacity=0.2)

points_biz_data = gadm_biz_data.filter(c_geometry_ref.is_in(points_ref))
fig.add_traces(px.scatter_map(points_biz_data,
               lat="lat",
                lon="lon",
               color="sector",
               hover_data=["source_id", "source_name", "emissions_quantity"],
               text="source_name",
               size = normalize_emissions_plot(points_biz_data[EMISSIONS_QUANTITY])
              ).data
)
fig.update_layout(showlegend=False)
fig.show()
fig2 = go.Figure()
fig2.add_trace(px.scatter_mapbox(points_biz_data,
               lat="lat",
                lon="lon",
               color="sector",
               hover_data=["source_id", "source_name", "emissions_quantity"],
               text="source_name",
               size = normalize_emissions_plot(points_biz_data[EMISSIONS_QUANTITY])
              ).data[0]
)
fig2.update_layout(
    mapbox_style="carto-positron",
    mapbox_zoom=8, mapbox_center = center(poly_geoms_pdf))
fig2.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig2.update_layout(showlegend=False)

Nearby sources#

We may be also interested in seeing all the sources in our neighborhood. Let us look now at all the sources within 80km of our region (you can adjust this threshold below).

As before, the following two queries:

  1. find all the geographical references to points within the radius of interest

  2. match the geographical references to emission data

neighbor_point_geoms = duckdb.sql(f"""
SELECT geometry_ref, gadm, lat, lng, ST_Distance_Spheroid(ST_POINT({lng},{lat}), ST_POINT(lat, lng)) AS distance FROM (
    SELECT *, ST_GeomFromWKB(geom_wkb) AS geom FROM all_points_pdf
        WHERE ST_Distance_Spheroid(ST_POINT({lng},{lat}), geom) < 80_000
) ORDER BY distance
""").pl()
neighbor_point_geoms.count()
c_distance = pl.col("distance")
neighbor_point_geom_refs = neighbor_point_geoms[GEOMETRY_REF].to_list()

neighbor_point_biz_data = (ct.read_source_emissions(
    gas=CO2,
    year=2024
)
 # Collecting over the full year
 .group_by(c_source_id)
 .agg(c_geometry_ref.first(),
      c_emissions_quantity.sum(),
      c_sector.first(),
      c_subsector.first(),
      c_source_name.first(),
      c_lat.first(),
      c_lon.first())
 # Filtering to the interesting points
 .filter(c_geometry_ref.is_in(neighbor_point_geom_refs)) # TODO: Polars issue, should move before agg
 .filter(~c_geometry_ref.is_in(points_ref)) # Remove the points within the area
 .filter(c_emissions_quantity != 0)
 .collect()
)
neighbor_point_biz_data.head(3)

Here is a plot of all the sources within the vicinity. For our island, we see that transportation dominates (harbours) as well as a big steel plant next to Amsterdam (Tata Steel).

fig = px.choropleth_map(poly_geoms_pdf,
    geojson=poly_geoms_pdf.geometry,
    locations=poly_geoms_pdf.index,
    color="gadm_level",
    zoom=6,
    center = center(poly_geoms_pdf),
    opacity=0.2)


fig.add_traces(px.scatter_map(neighbor_point_biz_data,
               lat="lat",
                lon="lon",
               color="sector",
               hover_data=["source_id", "source_name", "emissions_quantity"],
               text="source_name",
               size = normalize_emissions_plot(neighbor_point_biz_data[EMISSIONS_QUANTITY])
              ).data
)
fig.update_layout(showlegend=False)
fig.show()

Conclusion#

In this notebook, we have:

  • mapped a point on earth to the relevant area code in the Climate TRACE dataset

  • made some basic analysis of the footprint of this area

  • queried all the sources of emissions within this area as well as around this area

This was done just using the datasets, with no extra tools or databases and at high performance.

Exercise

I encourage you to explore various places around the earth and to investigate the sources of emissions around your neighborhood.

Exercise

Which city/county is the least (or most) emitting in your country?